ing the Needleman Wunsch algorithm and the Smith Waterman

m. Two groups of scores show a significant difference. Therefore,

nment-based sequence comparison approach does have the

ation power to separate species based on the sequence homology

t scores. This also demonstrates that the sequence structure

ation determines the species differentiation.

The alignment results of aligning MT042778 (SARS-CoV-2) with AB889999

V) and of aligning MT042778 with QANK0100268 (Yersinia pastis) using the

based sequence comparison approach, i.e., the Needleman-Wunsch algorithm

ith-Waterman algorithm.

Align with

AB889999

Align with

QANK01002681

Needleman

Smith

Needleman

Smith

nment length

143

113

327

259

ical pairs

109

109

167

129

ity percentage (%)

76.2

96.5

51.1

49.8

percentage (%)

21.7

0.9

42.8

20.5

e k-mers approach

ng sequences using the homology alignment approaches, either

r local, is very costly as aforementioned. The alignment-free

comparison approach is therefore of great interest. Most

t-free sequence comparison approaches explore the pattern of the

statistics for sequence comparison and are still widely used in

plications [Lichtblau, 2019, Randhawa, et al., 2020; Rohling,

20].

f the basic principles of the alignment-free sequence comparison

is to use the pattern of the sequence statistics to represent

s [Le and Huynh, 2019; Nguyen, et al., 2019; Guo, et al., 2020].

cess is called a feature extraction process. Suppose the nth

is denoted by ܛ. The mapping between ܛ and a feature vector

mulated by the following equation, where is a set of the nucleic

a set of amino acids, stands for the length of the nth sequence

nds for the feature space dimension,